Application of simultaneous uncertainty quantification and segmentation for oropharyngeal cancer use-case with Bayesian deep learning

Background Radiotherapy is a core treatment modality for oropharyngeal cancer (OPC), where the primary gross tumor volume (GTVp) is manually segmented with high interobserver variability. This calls for reliable and trustworthy automated tools in clinician workflow. Therefore, accurate uncertainty quantification and its downstream utilization is critical. Methods Here we propose uncertainty-aware deep learning for OPC GTVp segmentation, and illustrate the utility of uncertainty in multiple applications. We examine two Bayesian deep learning (BDL) models and eight uncertainty measures, and utilize a large multi-institute dataset of 292 PET/CT scans to systematically analyze our approach. Results We show that our uncertainty-based approach accurately predicts the quality of the deep learning segmentation in 86.6% of cases, identifies low performance cases for semi-automated correction, and visualizes regions of the scans where the segmentations likely fail. Conclusions Our BDL-based analysis provides a first-step towards more widespread implementation of uncertainty quantification in OPC GTVp segmentation.

In terms of processing time for auto-segmentation OPC GTVp, the method provides substantial improvements in comparison to manual segmentation as processing each scan takes on average in 2 and 8 seconds with Deep Ensemble and MC Dropout Ensemble, respectively, on a machine with RTX 2080 Ti while expert delineation can take from 30 minutes to two hours depending on the size and shape of the tumor (as noted by Outeiral et  al. 3 ).Using the proposed framework in Figure 1, when referring most uncertain cases to medical experts, while automating the rest, the method can improve the average performance over interobserver variability in terms of DSC while reducing the overall processing time when evaluated on this dataset.

Additional Qualitative Analysis
In this section, we present qualitative results for select cases in the MDA holdout dataset.Specifically, we describe our interpretations of model predictions and uncertainty maps relative to ground truth across multiple axial image slices, similar to how a case would be reviewed in the clinic.For simplicity, we only describe results of the MD Dropout Ensemble model."High" and "low" values are relative to median values described in the main text (e.g., high DSC is greater than 0.61).
Here we describe a case with low DSC (0.43) but high certainty (− = −0.43).This corresponds to a patient with a human papillomavirus positive base of tongue T2N0M0 tumor as per American Joint Committee on Cancer 7th edition 4 .Axial slice representations from superior to inferior slices for this case are shown in Supp Figure 4.As can be seen in the inferior slices (slice 54), there is initially a relatively large degree of uncertainty about the beginning of the prediction.Subsequently (slice 66), the model correctly predicts the tumor at the left base of tongue, with a simultaneous region of uncertainty appearing at the right base of tongue, likely secondary to the high PET signal causing a potential area of false positivity.This false positive PET signal is not ultimately included in the predicted segmentation mask, which in this case is seen as a desired outcome.More superiorly (slice 79), in terms of uncertainty and the resultant prediction, the model seems to have erroneously localized to the hyper-metabolic core of the primary tumor.Finally, at the most superior slices (slice 90), it is noted that there was metal streak artifact induced by dental hardware, which may have interfered with model inference and subsequent uncertainty quantification, as no prediction was generated.Main takeaways from this case include the model overemphasizing PET signal (which has been previously noted in PET/CT autosegmentation models) which is also reflected in the resultant uncertainty measures.Moreover, the image artifact may also impact performance and uncertainty estimation.
Here we describe a case with high DSC (0.64) but low certainty (− = −0.5).This corresponds to a patient with a human papillomavirus positive tonsillar T2N2M0 tumor as per American Joint Committee on Cancer 7th edition.Axial slice representations from superior to inferior slices for this case are shown in Supp Figure 5.In the inferior-most slices (slices 18-27), uncertainty is noted near the larynx, likely a byproduct of high PET signal.As before, this false positive PET signal is not ultimately included in the predicted segmentation mask, which in this case is seen as a desired outcome.More superiorly (slice 61), the model begins to predict a segmentation on only the right side of the base of tongue, when in reality the ground-truth is a bilateral segmentation.Importantly, the model starts to note uncertainty on the contralateral part of the image, which is a desired outcome.As we move further superiorly towards the tonsils (slice 70) the tumor begins to exhibit an uncommon presentation (discontinuous fragment, bilateral in both tonsils), but the prediction better starts to approximate the ground-truth; the uncertainty previously demonstrated at the contralateral side (left) is still present but has now started to become included in the predicted segmentation.Continuing superiorly (slice 78), there is still high uncertainty in the discontinuous fragment but the model is able to generate a reasonable prediction, however the model eventually starts to generate an implausible prediction in an air space (slice 85) as the prediction begins to generate a bilateral segmentation erroneously.As with the previous case, towards the superior-most part of the image (slices 90-94), metal streak artifact induced by dental hardware may alter the predictions and uncertainty estimation; notably, the prediction ignores the false positive PET signal.Main takeaways from this case include uncommon tumor presentations (e.g., fragmentation of tumor from one continuous piece to two pieces) may present issues in generating prediction and uncertainty.Moreover, as before, image artifacts may impact predictions and uncertainty estimation.
Here we describe an interesting case with high DSC (0.64) and high certainty (− = −0.41).This corresponds to a patient with a human papillomavirus positive tonsillar T3N2M0 tumor as per American Joint Committee on Cancer 7th edition.Axial slice representations from superior to inferior slices for this case are shown in Supp Figure 6.At the inferior slice (slice 60), the model generates the prediction correctly at the left tonsil but starts to note uncertainty at the contralateral tonsil.Subsequently, at the more superior slice (slice 70) the contralateral portion is revealed as part of the ground truth segmentation.The model is still uncertain about the area and ultimately does not include it as part of the prediction.In other words, the contralateral uncertainty indicates a false negative area that the model is uncertain about.In a clinical workflow this would correspond to an area the clinician could choose to further investigate.

Case 4: Nodal uncertainty.
Here we describe an interesting case with high DSC (0.71) and high certainty (− = −0.44).This corresponds to a patient with a human papillomavirus positive base of tongue T1N2M0 tumor as per American Joint Committee on Cancer 7th edition.Axial slice representations from superior to inferior slices for this case are shown in Supp Figure 7.In the inferior-most slice (slice 22) there is noted uncertainty in the area of high PET signal (likely spurious signal), which is not included in the prediction, which in this case is seen as a desired outcome.More superiorly (slices 61-70) a metastatic lymph node is present on the right side of the image; there is corresponding noted uncertainty about this area and it is ultimately not included in the prediction.The model is able to generate a prediction for the right base of tongue tumor without issues.Notably, as observed through the majority of other cases, metastatic lymph nodes are normally not considered by the model at all, likely due to Supp Fig. 1

: Additional correlation analysis between model segmentation performance as dependent variable and model certainty as independent variable.
Segmentation performance is based on Dice similarity coefficient (DSC) in millimeters and model uncertainty based on expected entropy (), predictive entropy (), mutual information (), coefficient of variation (), structure expected entropy (), structure predictive entropy (), structure mutual information (), and DSC-risk (  ).In the linear fit, the DSC is the dependent variable and uncertainty is the independent variable.Note that the y-axis is shared between the rows.

Supp Fig. 2: Additional correlation analysis between model segmentation performance based on mean surface distance and model certainty.
Segmentation performance is based on mean surface distance (MSD) in millimeters and model uncertainty based on expected entropy (), predictive entropy (), mutual information (), coefficient of variation (), structure expected entropy (), structure predictive entropy (), structure mutual information (), and DSC-risk (  ).The segmentation and uncertainty thresholds are drawn at the interobserver variability value of 2.5 mm MSD and at the predicted uncertainty value of the cross-validation 2.5 mm MSD, respectively.Note that the x-axis is shared between the columns.

Supp Fig. 3: Additional correlation analysis between model segmentation performance based on Hausdorff distance at 95% and model certainty.
Segmentation performance is based on Hausdorff distance at 95% (95HD) in millimeters and model uncertainty based on expected entropy (), predictive entropy (), mutual information (), coefficient of variation (), structure expected entropy (), structure predictive entropy (), structure mutual information (), and DSC-risk (  ).The segmentation and uncertainty thresholds are drawn at the interobserver variability value of 6.5 mm 95HD and at the predicted uncertainty value of 6.5 mm cross-validation 95HD, respectively.Note that the x-axis is shared between the columns.

Supp Fig. 4 :
Additional qualitative investigation of a case with low performance and high certainty.Number in top left corner = slice number; green dotted outline = ground-truth segmentation, red dotted outline = predicted segmentation.Blue, gray, and yellow colors in uncertainty maps correspond to low, medium, and high model uncertainty, respectively.Supp Fig. 5: Additional qualitative investigation of a case with high performance and low certainty.Number in top left corner = slice number; green dotted outline = ground-truth segmentation, red dotted outline = predicted segmentation.Blue, gray, and yellow colors in uncertainty maps correspond to low, medium, and high model uncertainty, respectively.Supp Fig. 6: Additional qualitative investigation of a case with contralateral uncertainty.Number in top left corner = slice number; green dotted outline = ground-truth segmentation, red dotted outline = predicted segmentation.Blue, gray, and yellow colors in uncertainty maps correspond to low, medium, and high model uncertainty, respectively.Supp Fig. 7: Additional qualitative investigation of a case with nodal uncertainty.Number in top left corner = slice number; green dotted outline = ground-truth segmentation, red dotted outline = predicted segmentation.Blue, gray, and yellow colors in uncertainty maps correspond to low, medium, and high model uncertainty, respectively.

Supp Table 1: PET/CT image acquisition parameters for the MDA external validation dataset.
All values were the same across all patients unless a parenthesis is shown, where the median and range are displayed.* only apply to CT data; ** only apply to PET data.

Table 2 : Demographic and clinical variable descriptive statistics for patients in the MD Anderson external validation cohort.
the absolute number of patients, except for age which is shown as the median and range.All patients were human papillomavirus positive and staged with the seventh edition of the American Joint Committee on Cancer staging system.